The impact of using large training data set KDD99 on classification accuracy
نویسندگان
چکیده
9 This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies. 10
منابع مشابه
IRDDS: Instance reduction based on Distance-based decision surface
In instance-based learning, a training set is given to a classifier for classifying new instances. In practice, not all information in the training set is useful for classifiers. Therefore, it is convenient to discard irrelevant instances from the training set. This process is known as instance reduction, which is an important task for classifiers since through this process the time for classif...
متن کاملEntropy Based Fuzzy Rule Weighting for Hierarchical Intrusion Detection
Predicting different behaviors in computer networks is the subject of many data mining researches. Providing a balanced Intrusion Detection System (IDS) that directly addresses the trade-off between the ability to detect new attack types and providing low false detection rate is a fundamental challenge. Many of the proposed methods perform well in one of the two aspects, and concentrate on a su...
متن کاملA New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملClassification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest
Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...
متن کاملImproving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PeerJ PrePrints
دوره 5 شماره
صفحات -
تاریخ انتشار 2017